dplyr verbs

There are five primary dplyr verbs, representing distinct data analysis tasks:

  • filter: Extract specified rows of a data frame
  • arrange: Reorder the rows of a data frame
  • select: Select specified columns of a data frame
  • mutate: Add new or transform a column of a data frame
  • summarise: Create collapsed summaries of a data frame
  • (group_by: Introduce structure to a data frame)

Filter

select a subset of the observations (horizontal selection)

load(here::here("data/french_fries.rda"))
french_fries |>
    filter(subject == 3, time == 1) #<<
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid
* <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
1 1     1         3           1    2.9     0      0      0  
2 1     1         3           2   14       0      0      1.1
3 1     2         3           1   13.9     0      0      3.9
4 1     2         3           2   13.4     0.1    0      1.5
5 1     3         3           1   14.1     0      0      1.1
6 1     3         3           2    9.5     0      0.6    2.8
# ℹ 1 more variable: painty <dbl>
ff_long <- french_fries |> pivot_longer(potato:painty, names_to = "type", values_to = "rating")

Arrange

order the observations (hierarchically)

french_fries |>
    arrange(desc(rancid)) |> #<<
    head()
# A tibble: 6 × 9
  time  treatment subject   rep potato buttery grassy rancid
  <fct> <fct>     <fct>   <dbl>  <dbl>   <dbl>  <dbl>  <dbl>
1 9     2         51          1    7.3     2.3      0   14.9
2 10    1         86          2    0.7     0        0   14.3
3 5     2         63          1    4.4     0        0   13.8
4 9     2         63          1    1.8     0        0   13.7
5 5     2         19          2    5.5     4.7      0   13.4
6 4     3         63          1    5.6     0        0   13.3
# ℹ 1 more variable: painty <dbl>

Select

select a subset of the variables (vertical selection)

french_fries |>
    select(time, treatment, subject, rep, potato) |> #<<
    head()
# A tibble: 6 × 5
  time  treatment subject   rep potato
  <fct> <fct>     <fct>   <dbl>  <dbl>
1 1     1         3           1    2.9
2 1     1         3           2   14  
3 1     1         10          1   11  
4 1     1         10          2    9.9
5 1     1         15          1    1.2
6 1     1         15          2    8.8

Summarise

summarize observations into a (set of) one-number statistic(s):

french_fries |>
    summarise( #<<
      mean_rancid = mean(rancid, na.rm=TRUE), 
      sd_rancid = sd(rancid, na.rm = TRUE)
      ) #<<
# A tibble: 1 × 2
  mean_rancid sd_rancid
        <dbl>     <dbl>
1        3.85      3.78

Summarise and group_by

french_fries |>
    group_by(time, treatment) |>
    summarise(mean_rancid = mean(rancid), sd_rancid = sd(rancid))
# A tibble: 30 × 4
# Groups:   time [10]
   time  treatment mean_rancid sd_rancid
   <fct> <fct>           <dbl>     <dbl>
 1 1     1                2.76      3.21
 2 1     2                1.72      2.71
 3 1     3                2.6       3.20
 4 2     1                3.9       4.37
 5 2     2                2.14      3.12
 6 2     3                2.50      3.38
 7 3     1                4.65      3.93
 8 3     2                2.90      3.77
 9 3     3                3.6       3.59
10 4     1                2.08      2.39
# ℹ 20 more rows

Let’s use these tools

to answer these french fry experiment questions:

  • Is the design complete?
  • Are replicates like each other?
  • How do the ratings on the different scales differ?
  • Are raters giving different scores on average?
  • Do ratings change over the weeks?

Completeness

  • If the data is complete it should be 12 x 10 x 3 x 2, that is, 6 records for each person in each week.

  • To check: tabulate number of records for each subject, time and treatment.

Work through this

How many values do we have for each subject? Check the help for function ?n

French Fries - completeness

n()

french_fries |> 
  group_by(subject) |> 
  summarize(n = n()) 
# A tibble: 12 × 2
   subject     n
   <fct>   <int>
 1 3          54
 2 10         60
 3 15         60
 4 16         60
 5 19         60
 6 31         54
 7 51         60
 8 52         60
 9 63         60
10 78         60
11 79         54
12 86         54

Other nice short cuts

instead of group_by(subject) |> summarize(n = n()) we can use:

  • group_by(subject) |> tally()
  • count(subject)

Counts for subject by time

french_fries |>
  na.omit() |>
  count(subject, time) |>
  pivot_wider(names_from="time", values_from="n")
# A tibble: 12 × 11
   subject   `1`   `2`   `3`   `4`   `5`   `6`   `7`   `8`
   <fct>   <int> <int> <int> <int> <int> <int> <int> <int>
 1 3           6     6     6     6     6     6     6     6
 2 10          6     6     6     6     6     6     6     6
 3 15          6     6     6     6     5     6     6     6
 4 16          6     6     6     6     6     6     6     5
 5 19          6     6     6     6     6     6     6     6
 6 31          6     6     6     6     6     6     6     6
 7 51          6     6     6     6     6     6     6     6
 8 52          6     6     6     6     6     6     6     6
 9 63          6     6     6     6     6     6     6     6
10 78          6     6     6     6     6     6     6     6
11 79          6     6     6     6     6     6     5     4
12 86          6     6     6     6     6     6     6     6
# ℹ 2 more variables: `9` <int>, `10` <int>

How do scores change over time?

ggplot(data=ff_long, aes(x=time, y=rating, colour=treatment)) +
  geom_point() +
  facet_grid(subject~type) 

Work through this

Get summary of ratings over replicates and connect the dots for a picture as below:

Cleaning

Material from 5521 XXX

What is a data plot?

  • data
  • aesthetics: mapping of variables to graphical elements
  • geom: type of plot structure to use
  • transformations: log scale, …
  • layers: multiple geoms, multiple data sets, annotation
  • facets: show subsets in different plots
  • themes: modifying style

Why?

  • With the grammar, a data plot becomes a statistic.

  • It is a functional mapping from variable to graphical element. Then we can do statistics on charts!

  • With a grammar, we don’t have individual animals in the zoo, we have the genetic code that says how one plot is related to another plot.

Elements of the grammar

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(
     mapping = aes(<MAPPINGS>),
     stat = <STAT>, 
     position = <POSITION>
  ) +
  <COORDINATE_FUNCTION> +
  <FACET_FUNCTION>

7 key elements:

  • DATA
  • GEOM_FUNCTION
  • MAPPINGS
  • STAT
  • POSITION
  • COORDINATE_FUNCTION
  • FACET_FUNCTION

Example: Tuberculosis data

(Current) TB case notifications data from WHO.
Also available via R package getTBinR.

ggplot(tb_ind, aes(x = year, 
                  y = count, 
                  fill = sex)) +
  geom_bar(stat = "identity") +
  facet_grid(~ age) 
  • What do you learn about tuberculosis incidences in Indonesia from this plot?
  • Give three changes to the plot that would improve it.

Fix the plot

Manually selected fill colors; theme with white background for better contrast

# This uses a color blind friendly scale
ggplot(tb_ind, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") + 
  facet_grid(~age_group)  + 
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) + 
  theme_bw() 

Color deficiency friendly color schemes

Compare males and females

ggplot(tb_ind, aes(x=year, y=count, fill=sex)) +  
  geom_bar(stat="identity", position="fill") + 
  ylab("proportion") + 
  facet_grid(~age_group) +  
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) 

TWO MINUTE CHALLENGE

  • What do we learn about the data that is different from the previous plot?
  • What is easier and what is harder or impossible to learn from this arrangement?

Separate plots

# Make separate plots for males and females, focus on counts by category
ggplot(tb_ind, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") +
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) + 
  facet_grid(sex~age_group) + 
  theme_bw()

Make a pie

# How to make a pie instead of a barchart - not straight forward
ggplot(tb_ind, aes(x=year, y=count, fill=sex)) +
  geom_bar(stat="identity") + 
  facet_grid(sex~age_group) + 
  scale_fill_manual("Sex", values = c("#DC3220", "#005AB5")) +
  coord_polar() + 
  theme_bw()

This isn’t a pie, it’s a rose plot!

Stacked bar

# Step 1 to make the pie
ggplot(tb_ind, aes(x = 1, y = count, fill = factor(year))) +
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~age_group) +
  scale_fill_viridis_d("", option="inferno") 

Pie chart

# Now we have a pie, note the mapping of variables
# and the modification to the coord_polar
ggplot(tb_ind, aes(x = 1, y = count, fill = factor(year))) + 
  geom_bar(stat="identity", position="fill") + 
  facet_grid(sex~age_group) +
  scale_fill_viridis_d("", option="inferno") +
  coord_polar(theta = "y") 

TWO MINUTE CHALLENGE

  • What are the pros, and cons, of using the pie chart for this data?
  • Would it be better if the pies used age for the segments, and facetted by year (and sex)?

Line plot vs barchart

ggplot(tb_ind, aes(x=year, y=count, colour=sex)) +
  geom_line() + geom_point() +
  facet_grid(~age_group) +
  scale_colour_manual("Sex", values = c("#DC3220", "#005AB5")) +
  ylim(c(0,NA)) +
  theme_bw()

  • We can read counts for both sexes
  • Males and females can be directly compared
  • Temporal trend is visible

Line plot vs barchart

tb_ind |> group_by(year, age_group) |> 
  summarise(p = count[sex=="m"]/sum(count)) |>
  ggplot(aes(x=year, y=p)) +
  geom_hline(yintercept = 0.50, colour="grey50", linewidth=2) +
  geom_line() + geom_point() +
  facet_grid(~age_group) +
  ylab("Proportion of Males") +
  theme_bw()

  • Attention is forced to proportion of males
  • Direct comparison of counts within year and age
  • Equal proportion guideline provides a baseline for comparison

Your turn

Make sure you can make all the TB plots just shown. If you have extra time, try to:

  • Facet by gender, and make line plots of counts of age.
  • Show the points only, and overlay a linear model fit.

gg extensions

  • There are more than 150 extensions to ggplot2
  • Many adhere to the grammar, to define new types of displays, like
    • ggdist: including representation of uncertainty and error
    • gganimate: specifies animations as layers
  • Others are supporting packages, such as
    • patchwork for laying out multiple displays
    • ggthemes for styling plots

https://exts.ggplot2.tidyverse.org/gallery/

ggdist

tb_inc_100k <- read_csv(here::here("data/TB_burden_countries_2025-07-22.csv")) |>
  filter(iso3 %in% c("USA", "AUS"))
ggplot(tb_inc_100k, aes(y = iso3, 
                        x = e_inc_100k)) +
  stat_gradientinterval(fill = "darkorange") +
  ylab("") +
  xlab("Inc per 100k") +
  theme_ggdist()

ggplot(tb_inc_100k, aes(y = iso3, 
                        x = e_inc_100k)) +
  stat_halfeye(side = "right") +
  geom_dots(side="left", 
                    fill = "darkorange", color = "darkorange") +
  ylab("") +
  xlab("Inc per 100k") +
  theme_ggdist()

Mapping and data

Map thinning

Warping the map to display statistics

cartograms, hexagon tiling

Plots and statistical inference

Tidy data and random variables

  • Tidy data mirrors elementary statistics
  • Tabular form puts variables in columns and observations in rows
  • Not all tabular data is in this form
  • In this form, we can think about \(X_1 \sim N(0,1), ~~X_2 \sim \text{Exp}(1) ...\)

\[\begin{align}X &= \left[ \begin{array}{rrrr} X_1 & X_2 & ... & X_p \end{array} \right] \\ &= \left[ \begin{array}{rrrr} X_{11} & X_{12} & ... & X_{1p} \\ X_{21} & X_{22} & ... & X_{2p} \\ \vdots & \vdots & \ddots& \vdots \\ X_{n1} & X_{n2} & ... & X_{np} \end{array} \right]\end{align}\]

Grammar of graphics and statistics

  • A statistic is a function on the values of items in a sample, e.g. for \(n\) iid random variates \(\bar{X}_1=\displaystyle\sum_{i=1}^n X_{i1}\), \(s_1^2=\displaystyle\frac{1}{n-1}\displaystyle\sum_{i=1}^n(X_{i1}-\bar{X}_1)^2\)
  • We study the behaviour of the statistic over all possible samples of size \(n\).
  • The grammar of graphics is the mapping of (random) variables to graphical elements, making plots of data into statistics

What is inference?

Inferring that what we see in the data at hand holds more broadly in life, society and the world.

Why do we need it for graphics?

Here’s an example tweeted by David Robinson based on an analysis in Tick Tock blog by Graham Tierney

To do statistical inference

You need a:

  • statistic computed from the data
  • null and alternative hypothesis
  • reference distribution on which to measure the statistic
    • if it is extreme on this scale, reject the null

Inference with data plots

You need a:

  • plot description provided by the grammar (a statistic)
    • This implies one or more null hypotheses
  • {{< fa dice >}} null generating mechanism, e.g. permutation, simulation from a distribution or model
  • {{< fa eye >}} visual evaluation: is one plot in the array different?

Some examples

Here are several plot descriptions.
What would be the null hypothesis in each?

ggplot(data) + geom_point(aes(x=x1, y=x2))            #  A
ggplot(data) + geom_point(aes(x=x1, y=x2, colour=cl)) #  B
ggplot(data) + geom_histogram(aes(x=x1))              #  C
ggplot(data) + geom_boxplot(aes(x=cl, y=x1))          #  D

Which plot definition would best match
\(H_0:\) there is no difference in the distribution between the groups?

Some examples

Here are several null hypotheses.
What type of plot would you use to test each?

  1. \(H_0:\) no association between x1 and x2
  2. \(H_0:\) no difference between levels of cl
  3. \(H_0:\) the distribution of x1 is XXX
  4. \(H_0:\) no difference in the distribution of x1 b/w levels of cl

Let’s do it

# Make a lineup of mtcars data
# 20 plots, one data, 19 nulls
# Which one is different?
set.seed(20190709)
library(ggplot2)
ggplot(
  lineup(
    null_permute('mpg'), 
    mtcars), 
  aes(mpg, wt)
) +
  geom_point() +
  facet_wrap(~ .sample)

Lineup

{{< fa mortar-pestle >}} Mix the data plot
into a field of null plots

pos <- sample(1:20, 1)
df_null <- lineup(
  null_permute('v1'), 
  df, pos=pos)
ggplot(
  df_null, 
  aes(x=v2, y=v1, fill=v2)
) + 
  geom_boxplot() +
  facet_wrap(~.sample, ncol=5) + 
  coord_flip()

Which plot is different?

Null-generating mechanisms

  • Permutation: randomizing the order of one of the variables breaks association, but keeps marginal distributions the same
  • Simulation: from a given distribution, or model. Assumption is that the data comes from that model.

Evaluation

  • Compute \(p\)-value
  • Power \(=\) signal strength

p-values

  • \(K\) independent observers
  • \(x\) individuals pick the data plot from \(m\) plots

Assuming that all plots in a lineup are equally likely to be selected,

\[P(X\geq x) = \sum_{i=x}^{K} \binom{K}{i} \left(\frac{1}{m}\right)^i\left(\frac{m-1}{m}\right)^{K-i}\]

p-values

\[P(X\geq x) = \sum_{i=x}^{K} \binom{K}{i} \left(\frac{1}{m}\right)^i\left(\frac{m-1}{m}\right)^{K-i}\]

This is a Binomial model

For \(x=4\) picks, \(K=17\) observers, \(m=20\) plots

library(nullabor)
pvisual(4, 17, m=20)
     x simulated  binom
[1,] 4     0.019 0.0088

p-values

But… some null plots are more visually salient!

p-values

Introduce a parameter \(\alpha\): visual salience of null plot dist

\[\begin{align}P(X \geq x) = &\sum_{i = x}^{K} \binom{K}{x} \frac{1}{B(\alpha, (m-1)\alpha)}\times \\ &B(x+\alpha, K-x+(m-1)\alpha),\end{align}\]

where \(B(.,.)\) is the Beta function.

This is a Beta-Binomial mixture model

p-values

  • Large \(\alpha\): several null plots are ‘interesting’
  • \(\alpha \approx 0.15\): 1 or 2 null plots are interesting enough to get some picks

Computing p-values with \(\alpha\):

library(vinference)
c("alpha = 0.01" = pVis(4,17,m=20, alpha=0.01, lower.tail=FALSE),
  "alpha = 0.15" = pVis(4,17,m=20, alpha=0.15, lower.tail=FALSE),
  "alpha = 1" = pVis(4,17,m=20, alpha=1, lower.tail=FALSE))
alpha = 0.01 alpha = 0.15    alpha = 1 
       0.034        0.259        0.472 

Goodness-of-fit & residuals

  • plot is a residual vs fitted scatterplot
  • null hypothesis is no association between the two statistics
  • null generating mechanism: residual rotation
# Assessing model fit, using a lineup of residual plots: 19 nulls + 1 resid plot
# Structure in the residual plot corresponding to less than random variation?
# Nulls are generated by `rotating` residuals after model fit.
tips <- read_csv("http://www.ggobi.org/book/data/tips.csv")
x <- lm(tip ~ totbill, data = tips)
tips.reg <- data.frame(tips, .resid = residuals(x), .fitted = fitted(x))
ggplot(lineup(null_lm(tip ~ totbill, method = 'rotate'), tips.reg)) +
  geom_point(aes(x = totbill, y = .resid)) +
  facet_wrap(~ .sample)

Goodness-of-fit & residuals

Let’s talk TB

Earlier:

  • Across all ages, and years, the proportion of males having TB is higher than females
  • These proportions tend to be higher in the middle age groups, for all years.
  • Relatively similar proportions occur across years.

Null hypothesis

Plot count against year, separately for each age group, coloured by sex.

  • Colouring by sex \(\Rightarrow\) primary comparison
  • Plot shows proportion of sex, given age group and year

\(H_0\): TB occurs equally among men and women, regardless of age and year.

\(H_A\): It doesn’t.

TB Lineup

# Make expanded rows of categorical variables matching the 
# counts of aggregated data. Sex needs to be converted to 0, 1
# to match binomial output.
tb_us_long <- uncount(tb_us, count)
tb_us_long <- tb_us_long |>
  mutate(sex01 = ifelse(sex=="m", 0, 1)) |>
  select(-sex)

# Generate a lineup of n=3, randomly choose the data position.
# Compute counts again.
pos = sample(1:3, 1)
l <- lineup(null_dist(var="sex01", dist="binom", 
                      list(size=1, p=0.5)), 
            true=tb_us_long, n=3, pos=pos)
l <- l |>
  group_by(.sample, year, age) |>
  count(sex01)

TB Lineup

ggplot(l, aes(x = year, y = n, fill = factor(sex01))) +
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(.sample ~ age) +
  scale_fill_brewer(palette="Dark2") + 
  theme(legend.position="none")

TB Lineup

A more complicated null

\(H_0\): Rates are the same across sex, regardless of age and year.
\(H_A\): They aren’t.

1tbl <- tb_us |> group_by(sex) |> summarise(count=sum(count))
tbl
p <- tbl$count[1]/sum(tbl$count)

2pos = sample(1:3, 1)
l <- lineup(null_dist(var="sex01", dist="binom",
                      list(size=1, p=p)),
            true=tb_us_long, n=3, pos=pos)
3l <- l |>
  group_by(.sample, year, age) |>
  count(sex01)
1
Compute proportion across all data
2
Create lineup, with null data sampled from a Binomial() distribution with the sample proportion as \(p\)
3
Compute aggregate results
# A tibble: 2 × 2
  sex   count
  <chr> <dbl>
1 f     25915
2 m     55640

TB Lineup

ggplot(l, aes(x = year, y = n, fill = factor(sex01))) +
  geom_bar(stat = "identity", position = "fill") +
  facet_grid(.sample ~ age) +
  scale_fill_brewer(palette="Dark2") + 
  theme(legend.position="none")

TB Lineup

Danger zone

  • \(H_0\) is determined based on the plot type

  • \(H_0\) is not based on the structure seen in the data set

  • Null data creation method does not match characteristics of original sample other than that in \(H_0\)

A map lineup example

Does one map show a spatial trend?

# Read xlsx spreadsheet on cancer incidence in USA, for a more
# complex lneup example, a lineup of maps
load("data/fifty_states.rda")
incd <- read_xlsx("data/IncRate.xlsx", skip=6, sheet=2) |>
  filter(!(State %in% c("All U.S. combined", "Kansas"))) |>
  select(State, `Melanoma of the skin / Both sexes combined`) |>
  rename(Incidence=`Melanoma of the skin / Both sexes combined`) |>
  mutate(Incidence = as.numeric(substr(Incidence, 1, 3)))

# State names need to coincide between data sets
incd <- incd |> mutate(State = tolower(State))

# Choose a position 
pos <- 6

# Make lineup of cancer incidence
incd_lineup <- lineup(null_permute('Incidence'), incd, n=18, pos=pos)

# Join cancer incidence data to map polygons
incd_map <- left_join(fifty_states, filter(incd_lineup, .sample==1),
                      by=c("id"="State"))
for (i in 2:18) {
  x <- left_join(fifty_states, filter(incd_lineup, .sample==i),
                      by=c("id"="State"))
  incd_map <- bind_rows(incd_map, x)
}
# Remove Kansas - it was missing the cancer data
incd_map <- incd_map |> filter(!is.na(.sample))

# Plot the maps as a lineup
ggplot(incd_map) + 
  geom_polygon(aes(x=long, y=lat, fill = Incidence, group=group)) + 
  expand_limits(x = incd_map$long, y = incd_map$lat) +
  coord_map() +
  scale_x_continuous(breaks = NULL) + 
  scale_y_continuous(breaks = NULL) +
  labs(x = "", y = "") +
  scale_fill_viridis_b(option = "D") + 
  theme(legend.position = "none", 
        panel.background = element_blank()) +
  facet_wrap(~.sample, ncol=6)

Cancer incidence across the US 2010-2014, Melanoma cases per 100k. Data source: American Cancer Society.

Your turn

  1. run this code,
  2. look at your lineup (and only your lineup)
  3. choose a plot
  4. run the decrypt line
  5. calculate x for your group
  6. use the pvisual function to compute the p-value, K= group size, m=18
  7. Try different \(\alpha\) values with pVis - how much difference does it make?

05:00

data(wasps)
lda_pred <- function(x) {
  d <- predict(lda(Group~., 
                   data=x[,-43]))$x[,1:2] |>
  as_tibble() |>
  mutate(Group = x$Group)
  return(d)
}
wasps_lineup <- lineup(null_permute('Group'), 
                       wasps[,-1], n=12) |>
  as_tibble()
wasps_lineup_lda <- wasps_lineup |>
  split(.$.sample) |>
  map_df(~lda_pred(.)) |>
  mutate(.sample = wasps_lineup$.sample)
ggplot(wasps_lineup_lda, aes(x=LD1, y=LD2, 
                             colour=Group)) + 
  geom_point() +
  facet_wrap(~.sample, ncol=4) +
  scale_colour_brewer(palette="Dark2") +
  theme(legend.position="none")

Perceptual principles

Game: Which plot wears it better?

Coming up: 2 different plots of 2012 TB incidence (e.g. newly diagnosed cases) in Kenya, based on variables:

tb_kn |> 
  filter(year == 2012) |> 
  dplyr::select(sex, age, count) |>
  head()
# A tibble: 6 × 3
  sex   age   count
  <chr> <chr> <dbl>
1 m     15-24 17304
2 m     25-34 25460
3 m     35-44 23057
4 m     45-54 23751
5 m     55-64 20204
6 m     65+    9554
  • In arrangement A, separate plots are made for age, and sex is mapped to the x axis.
  • Conversely, in arrangement B, separate plots are made for sex, and age is mapped to the x axis.

At which age(s) are the counts for males and females relatively the same?

Which plot makes this question easier to answer?

At which age(s) are the counts relatively similar across sex?

Which plot makes this easier? What do we learn from each? What’s the focus? What’s easy? What’s harder?

TWO MINUTE CHALLENGE

Write out a question that would be easier to answer from arrangement B.

Three Variables

Next, we have two different plots of TB incidence in Kenya, based on three variables:

tb_kn |> select(year, sex, age, count) |> head(10)
# A tibble: 10 × 4
    year sex   age   count
   <dbl> <chr> <chr> <dbl>
 1  1995 m     15-24   203
 2  1995 m     25-34   297
 3  1995 m     35-44   306
 4  1995 m     45-54   302
 5  1995 m     55-64   228
 6  1995 m     65+     109
 7  1995 f     15-24   160
 8  1995 f     25-34   244
 9  1995 f     35-44   282
10  1995 f     45-54   192
  • In plot type A, a line plot of counts is drawn separately by age and sex, and year is mapped to the x axis.
  • Conversely, in plot type B, counts for sex, and age are stacked into a bar chart, separately by age and sex, and year is mapped to the x axis

Is the trend for females generally decreasing over time? Which plot makes this easier?

Which type of plot makes it easier to answer

Is the trend for females generally decreasing over time?

TWO MINUTE CHALLENGE

What are the pros and cons of each way of displaying the same information? Should specific limits on axes be made?

Should the limits of the y axis in plot A include 0 (zero)?

TWO MINUTE CHALLENGE

Plot A shows the proportion as a line plot.
Plot B shows stacked bars scaled to 100% for females and males.

Is there an age effect in the proportion of incidence by gender? Is there a temporal trend in the proportions?

Perceptual principles

  • Hierarchy of mappings
  • Pre-attentive: some elements are noticed before you even realise it.
  • Color palettes: qualitative, sequential, diverging.
  • Proximity: Place elements for primary comparison close together.
  • Change blindness: When focus is interrupted differences may not be noticed.

Hierarchy of mappings

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

(Cleveland, 1984; Heer and Bostock, 2009)

TWO MINUTE CHALLENGE

Come up with a plot type for each of the mappings.

  1. Position - common scale (BEST)
  2. Position - nonaligned scale
  3. Length, direction, angle
  4. Area
  5. Volume, curvature
  6. Shading, color (WORST)

(Cleveland, 1984; Heer and Bostock, 2009)

Color palettes

display.brewer.all()
  • Sequential,
  • Diverging,
  • Qualitative

Color Brewer annotates palettes with attributes.

display.brewer.all()

Sequential

dsamp <- diamonds |>
  sample_n(1000)
(d <- ggplot(
  dsamp, aes(carat, price)) +
  geom_point(aes(
    colour = clarity)))

  • Emphasize one side of the spectrum

  • viridis package palette

    • maps to uniform grey scale

Sequential

d + scale_colour_brewer(direction = -1)
  • Default brewer sequential scale, blues.

  • Focus is on the dark blue.

Diverging

d + scale_colour_brewer(palette="PRGn")
  • Emphasize both ends, high AND low
  • De-emphasize middle

Qualitative

d + scale_colour_brewer(palette="Set1")

Map qualitative variables to most differentiated set of colors.

It’s possible to have too many colours to perceive differences.

TWO MINUTE CHALLENGE

Of the previous four colour schemes on the same data, which would be the most appropriate? Why?

  • viridis
  • ColorBrewer sequential Blues
  • ColorBrewer Diverging PRGn
  • ColorBrewer Categorical Set1

Color blind-proofing

clrs <- hue_pal()(9)
d + theme(legend.position = "none")

clrs <- dichromat(hue_pal()(9))
d + 
  scale_colour_manual("", values=clrs) + 
  theme(legend.position = "none")
  • Online checking tool coblis: upload an image and it will re-map the colors for different colour perception issues.
  • The package colorblind has color blind friendly palettes (Susan: but the colours are awful 😭).

Color blind simulation

Original colours

Color blind view

Pre-attentive

Can you find the odd one out?

Pre-attentive

Is it easier now?

Proximity

Place elements that you want to compare close to each other. If there are multiple comparisons to make, you need to decide which one is most important.

Mapping and proximity

Same proximity is used, but different geoms.

  • Which is better to determine the relative ratios of males to females by age?

Mapping and proximity

Same proximity is used, but different geoms.

Which is better to determine the relative ratios of ages by sex?

Change blindness

ggplot(dsamp, aes(x=carat, y=price, colour = clarity)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  scale_color_brewer(palette="Set1") +
  facet_wrap(~clarity, ncol=4)

Which has the steeper slope, VS1 or VS2?

Change blindness

Making comparisons across plots requires the eye to jump from one focal point to another.

It may result in not noticing differences.

ggplot(dsamp, aes(x=carat, y=price, 
                  colour = clarity)) +
  geom_point() +
  geom_smooth(se=FALSE) +
  scale_color_brewer(palette="Set1") 

Core principles

  • Make a plot of your data!
    • The hierarchy matters if the structure is weak or differences b/w groups are small.
  • Knowing how to use proximity is a valuable and rare skill
  • Use of colour: don’t over use
    • Too many colours
    • Mapping cts variable to colour to add another dimension

Core principles

  • Show the data!
    • Statistics are good if there’s too much data
    • Always plot the data for yourself to see the variability
  • One plot is never enough
    • Plot the data in different ways
    • Understand the relationships between variables

Your turn

This builds on the exercise from the previous session.

  • Using your choice of country, for example, Australia, make a set of plots to explore the TB incidence among males relative to females over different age groups for 2012.
  • Choose your best plot to answer this question: Is there a higher prevalence of TB among younger women in 2012?

Resources